Kestrel-3: Microcoding vs. Pipelining

As I progress on a design which will eventually become the KCP53010 processor for the Kestrel-3, I happened upon something which I'd not seen documented elsewhere before: a difference in the design of microcoded versus pipelined architectures. You could, as a first-order approximation, consider it the difference between CISC and RISC micro-architectures; however, it's not exactly accurate, as you can have pipelined CISC architectures and microcoded RISC machines as well.

Microcoded Microarchitectures

With a microcoded microarchitecture, you first need to lay out what operations you want your user to be able to perform. Historically, this is the processor's instruction set architecture. Consider the simple act of loading a register from memory on a 6502 processor. First, you must resolve the effective address from which to load the data from. Then you fetch from that address, and store the result in the appropriate register. But, there is an asymmetry in the ISA's design. Because the 6502 only uses 8-bits to encode the operation, and because this space is shared with a bunch of other primitives, different sets of addressing modes apply to different registers. Moreover, the act of committing the final value written can occur in a variety of different points in time, depending on the addressing mode selected. The end result is you end up with a bunch of different yet related micro-programs.

    Operation: LDA/LDX/LDY #nn (Load R immediate)
    Clock Micro-step
      1   Present PC to memory address bus, and select a read operation.  Wait for bus ready.
      2   Present data bus to result bus; select write-enable for appropriate register.
          Increment PC.
      3   Commence next instruction fetch.

    Operation: LDA/LDX/LDY $nnnn (Load R absolute)
    Clock Micro-step
      1   Present PC to memory address bus, and select a read operation.  Wait for bus ready.
      2   Present data bus to temp pointer register low; write-enable temp pointer low.
          Increment PC.
      3   Present PC to memory address bus, and select a read operation.  Wait for bus ready.
      4   Present data bus to address bus high.
          Present temp pointer low register to address bus low.
          Select a read operation.
          Wait for bus ready.
          Increment PC.
      5   Present data bus to result bus; select write-enable for appropriate register.
      6   Commence next instruction fetch.

And so on; you get the idea. This approach is very much like writing a conventional program.

NOTE: The 6502 was designed even more cleverly than what I described above; however, the end result is essentially the same.

So, the end result is you have a single, monolithic block of instruction decode logic that breaks an instruction down into fields and recognizes cycles within those patterns. For example, if opcode bits matches pattern for LDA and we're in cycle 2, then assert these control signals. If the opcode bits matches the pattern for STA and we're in cycle 4, then assert those control signals instead. And so forth.

Pipelined Microarchitectures

When building a pipelined microarchitecture, however, your approach is inverted, at least in some sense. You are attempting to find a sequence of processing steps that apply to as many instructions as possible. Then, for each step, you need to isolate which instructions exploit that stage's features. For example, most RISC instructions follow a fairly generic template of three steps: (1) fetch operands from the appropriate sources, (2) transform them according to the instruction's requirements, (3) write the result back. Some instructions do require a few additional steps, of course; memory operations need to fetch or store data for instance. This may add one or two steps to the process. For non-memory instructions, these steps would just do nothing. Since all instructions under this scheme would take the same amount of time to execute, it doesn't matter that some steps do nothing, as instruction latency is obscured by the fact that you can have at least as many instructions in-flight as there are pipeline stages. As a result, as long as you have good throughput, observed latency appears to be one cycle per instruction.

The canonical set of pipeline stages seems to have settled on the following steps:

Instruction fetch.
Operand fetch.
Execution.
Memory Access.
Write-back.

But, for something like a 65816 processor which has some pretty hairy addressing modes, you might need a lot more pipeline stages. Let's look at an extreme case for this processor: ADC (3,S),Y. In C, this would be effectively equivalent to a = a + *((uint8_t *)(y + *((uint8_t *)(s + 3)))).

This means we'd need the following pipeline stages to pull something like this off:

Instruction fetch.
Index 1 operand fetch.
Index 1 addition.
Index 1 memory access / Index 2 operand fetch.
Index 2 addition.
Index 2 memory access / Register operand fetch.
ALU transformation.
Memory access.
Register writeback.

The longer pipeline means that it's more difficult to keep it full (especially considering average basic block size is something like 8 instructions for conventionally written software). It also means that some of those pipeline stages will be sitting there doing nothing but passing instructions along, as it's generally quite rare to use (sr,S),Y addressing mode. However, the index-2 stages would see frequent use, as memory-to-memory operations are extremely common on the 6502/65816 architecture. It's not uncommon to unroll loops like this:

    ; Merge screen contents into frame buffer
    LDX #$00
1$: LDA $0400,X
    AND andMask,X
    ORA orMask,X
    STA $0400,X
    LDA $0500,X
    AND andMask+256,X
    ORA orMask+256,X
    STA $0500,X
    LDA $0600,X
    AND andMask+512,X
    ORA orMask+512,X
    STA $0600,X
    LDA $0700,X
    AND andMask+768,X
    ORA orMask+768,X
    STA $0700,X
    INX
    BNE 1$

Don't let utilization factors overly concern you, however; the MC68040 processor, for example, had two of its six pipeline stages dedicated to effective address calculation and fetching, contributing to a consistent 2x to 3x performance benefit over its competitor, the Intel 80486. As long as the idle stages aren't egregious in number, you should be OK in practice.